exit
Predictive Modeling of Weather Station Data:
Linear Regression vs. Graph Neural Networks
Slides: slides.html ( Go to slides.qmd to edit)
Remember: Your goal is to make your audience understand and care about your findings. By crafting a compelling story, you can effectively communicate the value of your data science project.
Carefully read this template since it has instructions and tips to writing!
Introduction
This section will be expanded as the modeling process is further refined
Accurate weather prediction is a crucial task with widespread implications across agriculture, transportation, disaster preparedness, and energy management. Traditional forecasting methods often rely on statistical models or physics-based simulations, however, with the advancement of graphical neural networks (GNN) we believe there is potential in a more modern deep learning approach.
In this project, we explore the predictive power of a traditional linear regression model and a GNN on real-world weather station data. Our aim is to evaluate whether the GNN’s ability to incorporate spatial relationships between stations offers a measurable advantage over more conventional techniques
The dataset consists of multiple weather stations located within the same geographic region. Each station collects meteorological variables over time, and can be represented as a node within a broader spatial network. For the linear model baseline, a single model will be trained using all stations’ data simultaneously, treating each station as an independent feature source.
For the GNN the model will be trained on the entire network of stations, where each node corresponds to a station and edges represent spatial relationships. The graph is encoded via a dense adjacency matrix, excluding self-connections. The GNN aims to leverage the inherent spatial structure of the data, potentially capturing regional weather patterns and inter-station dependencies that are invisible to traditional models.
Our evaluation focuses on forecasting performance over a 6-month test period at the end of the dataset. We asses how well each modelling approach predicts key weather variables and investigate the conditions under which one model may outperform the other.
Methods
This section will be expanded as the modeling process is further refined
This section outlines the modeling approaches, data structure, and training procedures used to compare the traditional linear model and the GNN on weather station data.
1. Data selection
Work in progress
2. Cleaning Process
Work in progress
3. Linear Model
The linear model is formulated as a time-series regression task. It uses the feature information from the previous four time steps to predict the feature values at the next time step. Each input consists of a concatenation of the five meteorological features across four sequential time steps, resulting in a fixed length input vector per prediction target. The five input features are:
- Temperature
- Relative Humidity
- Wind Speed
- Wind Direction (represented as sin and cosine components)
4. GNN
The GNN is designed to capture spatiotemporal dependencies in the weather station network. It is implemented using PyTorch and follows a structure inspired by the Diffusion Convolutional Recurrant Neural Network (DCRNN) architecture.
- Architecture
- Input Format:
- Data is structured using the StaticGraphTemporalSignal format,
- Input Format:
where each node represents a weather station and temporal sequences of node features are used for prediction.
- Layers:
- A DCRNN layer to capture spatial and temporal dependencies
- A ReLU activation function
- A Linear output layer for final prediction
- Training Configuration
- Optimizer
- Adam
- Optimizer
- Learning Rate:
- Base learning rate of 0.01 but will reduce by 0.1 at a plateau
- Epochs:
- Trained for a maximum of 100 epochs with an early exit callback
The model is trained to predict the same five features (temperature, relative humidity, wind speed, wind direction sin, wind direction cosine) for the next time step based on the preceding four time steps, analogous to the linear model.
Analysis and Results
Data Exploration and Visualization
Describe your data sources and collection process.
Present initial findings and insights through visualizations.
Highlight unexpected patterns or anomalies.
A study was conducted to determine how…
Code
start_date = datetime.datetime(2010, 1, 1, 0, 0)
end_date = datetime.datetime(2020, 12, 31, 0, 0)
seed = 3435
split_index = 730
pl.enable_string_cache()
data_path = r'kansas_asos_2010_2020.csv'| station | valid | lat | lon | elevation | tmpf | dwpf | relh | sknt | feel | drct_sin | drct_cos |
|---|---|---|---|---|---|---|---|---|---|---|---|
| cat | datetime[μs] | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f64 | f64 |
| "GCK" | 2018-01-01 00:00:00 | 37.927502 | -100.724403 | 881.0 | 9.283334 | -6.5 | 48.148335 | 8.666667 | -4.161667 | 0.851117 | 0.524977 |
| "LBL" | 2018-01-01 00:00:00 | 37.044201 | -100.9599 | 879.0 | 12.316667 | -1.983333 | 52.014999 | 10.166667 | -1.721667 | 0.664796 | 0.747025 |
| "EHA" | 2018-01-01 00:00:00 | 37.000801 | -101.879997 | 1099.0 | 15.555555 | 5.255556 | 63.382778 | 7.388889 | 4.482222 | 0.970763 | 0.24004 |
| "HQG" | 2018-01-01 00:00:00 | 37.163101 | -101.370499 | 956.52002 | 14.311111 | -1.605556 | 48.468334 | 7.777778 | 2.681667 | 0.936332 | 0.351115 |
| "3K3" | 2018-01-01 00:00:00 | 37.991699 | -101.7463 | 1005.700012 | 13.1 | -0.9 | 53.127777 | 6.777778 | 2.151111 | 0.981255 | 0.192712 |
| … | … | … | … | … | … | … | … | … | … | … | … |
| "EHA" | 2020-12-31 00:00:00 | 37.000801 | -101.879997 | 1099.0 | 42.355556 | 16.588888 | 34.955555 | 3.0 | 40.254444 | -0.533205 | -0.845986 |
| "HQG" | 2020-12-31 00:00:00 | 37.163101 | -101.370499 | 956.52002 | 40.722221 | 13.255555 | 32.23111 | 1.666667 | 39.450001 | 0.977334 | -0.211704 |
| "3K3" | 2020-12-31 00:00:00 | 37.991699 | -101.7463 | 1005.700012 | 40.200001 | 12.2 | 31.377777 | 4.555555 | 36.683334 | -0.700217 | -0.71393 |
| "JHN" | 2020-12-31 00:00:00 | 37.578201 | -101.7304 | 1012.710022 | 40.711113 | 17.4 | 38.554443 | 4.777778 | 37.06889 | -0.824675 | -0.565607 |
| "19S" | 2020-12-31 00:00:00 | 37.496899 | -100.832901 | 892.570007 | 39.922222 | 15.388889 | 36.552223 | 6.111111 | 35.113335 | -0.824675 | 0.565607 |
True
array([<Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >,
<Axes: >], dtype=object)
array([<Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >,
<Axes: >], dtype=object)
| station | tmpf | relh | sknt | drct_sin | drct_cos |
|---|---|---|---|---|---|
| cat | f64 | f64 | f64 | f64 | f64 |
| "GCK" | -1.355721 | 0.481483 | -0.080357 | 0.851117 | 0.524977 |
| "LBL" | -1.265174 | 0.52015 | 0.160714 | 0.664796 | 0.747025 |
| "EHA" | -1.168491 | 0.633828 | -0.285714 | 0.970763 | 0.24004 |
| "HQG" | -1.205638 | 0.484683 | -0.223214 | 0.936332 | 0.351115 |
| "3K3" | -1.241791 | 0.531278 | -0.383929 | 0.981255 | 0.192712 |
| … | … | … | … | … | … |
| "EHA" | -0.368491 | 0.349556 | -0.991072 | -0.533205 | -0.845986 |
| "HQG" | -0.417247 | 0.322311 | -1.205357 | 0.977334 | -0.211704 |
| "3K3" | -0.432836 | 0.313778 | -0.741072 | -0.700217 | -0.71393 |
| "JHN" | -0.417579 | 0.385544 | -0.705357 | -0.824675 | -0.565607 |
| "19S" | -0.441128 | 0.365522 | -0.491072 | -0.824675 | 0.565607 |
<All keys matched successfully>
RecurrentGCN(
(recurrent1): DCRNN(
(conv_x_z): DConv(204, 64)
(conv_x_r): DConv(204, 64)
(conv_x_h): DConv(204, 64)
)
(recurrent2): DCRNN(
(conv_x_z): DConv(96, 32)
(conv_x_r): DConv(96, 32)
(conv_x_h): DConv(96, 32)
)
(recurrent3): DCRNN(
(conv_x_z): DConv(64, 32)
(conv_x_r): DConv(64, 32)
(conv_x_h): DConv(64, 32)
)
(linear): Linear(in_features=32, out_features=1, bias=True)
)
MSE: 0.0562
RecurrentGCN(
(recurrent1): DCRNN(
(conv_x_z): DConv(204, 64)
(conv_x_r): DConv(204, 64)
(conv_x_h): DConv(204, 64)
)
(recurrent2): DCRNN(
(conv_x_z): DConv(96, 32)
(conv_x_r): DConv(96, 32)
(conv_x_h): DConv(96, 32)
)
(recurrent3): DCRNN(
(conv_x_z): DConv(64, 32)
(conv_x_r): DConv(64, 32)
(conv_x_h): DConv(64, 32)
)
(linear): Linear(in_features=32, out_features=1, bias=True)
)
RecurrentGCN(
(recurrent1): DCRNN(
(conv_x_z): DConv(204, 64)
(conv_x_r): DConv(204, 64)
(conv_x_h): DConv(204, 64)
)
(recurrent2): DCRNN(
(conv_x_z): DConv(96, 32)
(conv_x_r): DConv(96, 32)
(conv_x_h): DConv(96, 32)
)
(recurrent3): DCRNN(
(conv_x_z): DConv(64, 32)
(conv_x_r): DConv(64, 32)
(conv_x_h): DConv(64, 32)
)
(linear): Linear(in_features=32, out_features=1, bias=True)
)
LinearRegression()
MSE: 0.0147
[<matplotlib.lines.Line2D object at 0x0000023610CEE110>]
[<matplotlib.lines.Line2D object at 0x000002361116D010>]
Text(0.5, 1.0, 'Station GCK')
[<matplotlib.lines.Line2D object at 0x0000023610FD5CD0>]
[<matplotlib.lines.Line2D object at 0x0000023610FD7050>]
Text(0.5, 1.0, 'Station JHN')
[<matplotlib.lines.Line2D object at 0x0000023611078E50>]
[<matplotlib.lines.Line2D object at 0x0000023611078890>]
Text(0.5, 1.0, 'Station LBL')
[<matplotlib.lines.Line2D object at 0x0000023611078A50>]
[<matplotlib.lines.Line2D object at 0x000002361107BD90>]
Text(0.5, 1.0, 'Station HQG')
[<matplotlib.lines.Line2D object at 0x0000023611078FD0>]
[<matplotlib.lines.Line2D object at 0x00000236110782D0>]
Text(0.5, 1.0, 'Station 19S')
[<matplotlib.lines.Line2D object at 0x000002361105CF10>]
[<matplotlib.lines.Line2D object at 0x000002361105DE10>]
Text(0.5, 1.0, 'Station EHA')
[<matplotlib.lines.Line2D object at 0x000002361105E690>]
[<matplotlib.lines.Line2D object at 0x000002361105EA90>]
Text(0.5, 1.0, 'Station 3K3')
<matplotlib.legend.Legend object at 0x0000023610F89D50>
Text(0.5, 0.98, 'LR Actual vs Predicted Over Time for Each Node')
[<matplotlib.lines.Line2D object at 0x0000023611088E10>]
Text(0.5, 1.0, 'Station GCK')
[<matplotlib.lines.Line2D object at 0x0000023610C766D0>]
Text(0.5, 1.0, 'Station JHN')
[<matplotlib.lines.Line2D object at 0x0000023610D01390>]
Text(0.5, 1.0, 'Station LBL')
[<matplotlib.lines.Line2D object at 0x00000236110DAC10>]
Text(0.5, 1.0, 'Station HQG')
[<matplotlib.lines.Line2D object at 0x0000023610B8AE10>]
Text(0.5, 1.0, 'Station 19S')
[<matplotlib.lines.Line2D object at 0x0000023610B8B8D0>]
Text(0.5, 1.0, 'Station EHA')
[<matplotlib.lines.Line2D object at 0x0000023610B89D50>]
Text(0.5, 1.0, 'Station 3K3')
<matplotlib.legend.Legend object at 0x0000023610E863D0>
Text(0.5, 0.98, 'LR Absolute Error for Each Station')
RecurrentGCN(
(recurrent1): DCRNN(
(conv_x_z): DConv(204, 64)
(conv_x_r): DConv(204, 64)
(conv_x_h): DConv(204, 64)
)
(recurrent2): DCRNN(
(conv_x_z): DConv(96, 32)
(conv_x_r): DConv(96, 32)
(conv_x_h): DConv(96, 32)
)
(recurrent3): DCRNN(
(conv_x_z): DConv(64, 32)
(conv_x_r): DConv(64, 32)
(conv_x_h): DConv(64, 32)
)
(linear): Linear(in_features=32, out_features=1, bias=True)
)
[<matplotlib.lines.Line2D object at 0x0000023610D4C810>]
[<matplotlib.lines.Line2D object at 0x000002360AA6B310>]
Text(0.5, 1.0, 'Station GCK')
(0.0, 500.0)
[<matplotlib.lines.Line2D object at 0x0000023610F58D10>]
[<matplotlib.lines.Line2D object at 0x000002360AD96F10>]
Text(0.5, 1.0, 'Station JHN')
(0.0, 500.0)
[<matplotlib.lines.Line2D object at 0x000002360AD950D0>]
[<matplotlib.lines.Line2D object at 0x000002360AD96C90>]
Text(0.5, 1.0, 'Station LBL')
(0.0, 500.0)
[<matplotlib.lines.Line2D object at 0x000002360AD95610>]
[<matplotlib.lines.Line2D object at 0x000002360AD97110>]
Text(0.5, 1.0, 'Station HQG')
(0.0, 500.0)
[<matplotlib.lines.Line2D object at 0x000002360A8AD0D0>]
[<matplotlib.lines.Line2D object at 0x000002360AD8C750>]
Text(0.5, 1.0, 'Station 19S')
(0.0, 500.0)
[<matplotlib.lines.Line2D object at 0x000002360AD8DB10>]
[<matplotlib.lines.Line2D object at 0x000002360AD8F0D0>]
Text(0.5, 1.0, 'Station EHA')
(0.0, 500.0)
[<matplotlib.lines.Line2D object at 0x000002360AD8D910>]
[<matplotlib.lines.Line2D object at 0x000002360AD8C890>]
Text(0.5, 1.0, 'Station 3K3')
(0.0, 500.0)
<matplotlib.legend.Legend object at 0x000002360AA5CF50>
Text(0.5, 0.98, 'GNN vs. LR Absolute Error for Each Station')
Modeling and Results
Explain your data preprocessing and cleaning steps.
Present your key findings in a clear and concise manner.
Use visuals to support your claims.
Tell a story about what the data reveals.
Conclusion
Summarize your key findings.
Discuss the implications of your results.